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Abstract 


The current study examined the degree to which the quality 
and characteristics of students’ essays could be modeled 
through dynamic natural language processing analyses. Un- 
dergraduate students (n = 131) wrote timed, persuasive es- 
says in response to an argumentative writing prompt. Recur- 
rent patterns of the words in the essays were then analyzed 
using recurrence quantification analysis (RQA). Results of 
correlation and regression analyses revealed that the RQA 
indices were significantly related to the quality of students’ 
essays, at both holistic and sub-scale levels (e.g., organiza- 
tion, cohesion). Additionally, these indices were able to ac- 
count for between 11% and 43% of the variance in students’ 
holistic and sub-scale essay scores. Overall, our results sug- 
gest that dynamic techniques can be used to improve natural 
language processing assessments of student essays. 


Introduction 


Adaptive educational technologies aim to improve student 
learning by relieving some of the pressures faced by in- 
structors, as well as by providing students with personal- 
ized practice opportunities (Crossley & McNamara, 2016; 
Nkambou, Mizoguchi, & Bourdeau, 2010). These technol- 
ogies increasingly rely on natural language processing 
(NLP) techniques to extract information about student per- 
formance and individual differences (Allen, Snow, & 
McNamara, 2015; Graesser, Chipman, Haynes, & Olney, 
2005; McNamara, Boonthum, Levinstein, & Millis, 2007). 
Compared to non-interactive learning tasks, these NLP- 
based tutoring systems have been shown to lead to signifi- 
cant gains in student learning (e.g., Graesser et al., 2005). 
A principal strength of the NLP techniques employed by 
these systems is their calculation of linguistic information 
across a wide variety of dimensions and window sizes 
(e.g., word sophistication, sentence complexity, document 
cohesion). For example, analyses conducted at the level of 
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individual words can reveal information about the topics 
(Blei, Ng, & Jordan, 2003) and concreteness of the lan- 
guage found in a document (Brysbaert, Warriner, & Ku- 
perman, 2014). Similar analyses can be conducted at the 
sentence, paragraph, and document levels to reveal infor- 
mation about student writing, such as lexical sophistication 
and cohesion (see McNamara et al., 2014 for a review). 

Importantly, once these indices have been calculated, 
they can be used to model information about students’ per- 
formance on learning tasks or individual differences in 
their knowledge and skills. For instance, researchers have 
used these indices to predict expert ratings of essay quality 
(e.g., Dikli, 2006; McNamara et al., 2015; Shermis & 
Burstein, 2003) as well as individual differences in their 
reading skills (Allen, Snow, & McNamara, 2015) and af- 
fective states (Allen et al., 2016; D’Mello, Dowell, & 
Graesser, 2009). Overall, extensive prior research indicates 
that these NLP techniques can produce powerful sources of 
data for the development of educational assessments and 
adaptive educational technologies. 

Despite their success, however, these NLP techniques 
have substantial room for improvement. One particularly 
salient weakness of these techniques relates to the fact that 
the majority of indices are calculated based on aggregate 
metrics of student language. For example, the /exical so- 
phistication of a student’s essay would be calculated as an 
average value of the sophistication of all words produced 
by the student, but would not take into consideration how 
these words were distributed throughout the essay. As a 
consequence, these indices may miss out on important nu- 
ances in the structure of student writing. 


Dynamic Language Analyses 


In the current study, we address this gap by conducting 
dynamic computational analyses of the words in students’ 
essays. Analyses of the dynamic patterns in students’ lan- 
guage provide a method through which researchers can 


model the ways in which language is structured. NLP anal- 
yses traditionally calculate aggregate measures of linguistic 
features over time, which potentially miss out on important 
information about language structure. Dynamic techniques, 
on the other hand, consider time to be of critical im- 
portance and intentionally factor temporal patterns into 
analyses. In this way, dynamic methodologies more appro- 
priately account for the complexity that is inherent in lan- 
guage as it unfolds over time. Although not commonly 
applied to language, dynamical techniques have been pre- 
viously used in a variety of scientific domains as a means 
of characterizing human behavior (e.g., Anderson, Bischof, 
Laidlaw, Risko, & Kingstone, 2013; Dale & Spivey, 2005; 
Shockley, Santana, & Fowler, 2003). 

To illustrate the purpose of these dynamic language 
analyses, consider a student, Josie, who has been asked to 
write a persuasive essay that responds to the question: Do 
people achieve more success by cooperation or by compe- 
tition? How might the words that Josie uses change over 
the course of the essay that she produces? If Josie has less 
knowledge of the topic (and consequently less evidence to 
substantiate her claims), she might simply repeat similar 
words and phrases throughout her essay without bringing 
in outside information. On the other hand, if Josie has high 
knowledge on this topic, she might be more likely to bring 
in outside information and only repeat certain key words 
and phrases throughout the essay. 

This example highlights important differences that must 
be considered when modeling the writing processes en- 
gaged by students, which may ultimately contribute to 
more nuanced assessments of their performance. For ex- 
ample, while surface-level features of a student’s essay 
may be able to be modeled with more traditional, static 
NLP metrics (e.g., word frequency), the coherence of their 
writing may require the writer to distribute topical infor- 
mation and outside evidence in specific ways. The distribu- 
tions of this information may therefore be missed in the 
absence of dynamical analyses. 


Recurrence Quantification Analysis 


In the current paper, we use Recurrence Quantification 
Analysis (RQA) to quantify the extent to which recurrent 
patterns in students’ persuasive essays relate to expert rat- 
ings of their quality and characteristics. RQA is a nonlinear 
technique that provides information about patterns of re- 
petitive behavior in continuous or categorical time series 
(Marwan, Romano, Thiel, & Kurths, 2007). Similar to 
many techniques used in dynamical systems theory re- 
search, this technique has been used in a variety of do- 
mains to characterize temporal patterns of human and non- 
human behavior (Dale & Spivey, 2005; Marwan, Wessel, 
Meyerfeldt, Schirdewan, & Kurths, 2002). For instance, 
RQA has been used to characterize heartrate variability 


241 


(Marwan et al., 2002), postural fluctuations (Riley, Bal- 
asubramaniam, & Turvey, 1999), and eye movements (An- 
derson et al., 2013). 

Recently, researchers have demonstrated that RQA can 
be applied to categorical data sets and, consequently, be 
used to provide information about human language (Dale 
& Spivey, 2005). The fact that this technique can be ap- 
plied to both continuous and categorical data sets may be 
particularly important for the study of natural language, 
because it can measure multiple levels of the text, rather 
than relying on unidimensional analyses. 

RQA analyses begin with the development of a recur- 
rence plot, which is a visualization of a matrix where the 
individual elements represent points in a time series that 
are visited more than once. Therefore, the recurrence plot 
represents the times in which a dynamical system visits the 
same area in a phase space (Marwan et al., 2007). Each 
point in the plot represents a particular state that is revisit- 
ed by the system (e.g., a word). If multiple points occur 
together, they form diagonal lines; these lines represent 
times when the system revisits an entire sequence of states. 

After the recurrence plots are generated, quantitative 
analyses can be conducted to quantify these plots. RQA 
calculates numerous indices that quantify recurrent patterns 
in a particular system (e.g., a text) to allow for statistical 
comparisons of multiple systems (Zbilut & Webber, 1992; 
Coco & Dale, 2013 for more information). 


Current Study 


The current study investigates how and whether infor- 
mation about students’ writing performance can be mod- 
eled through dynamical analyses of their word use. To this 
end, we use RQA to calculate seven indices based on the 
temporal distributions of students’ word use. Our aim is to 
then use these indices to model the holistic quality and 
characteristics of the essays. 

We collected timed, persuasive essays written by under- 
graduate students and scored by expert human raters. We 
hypothesized that the RQA indices would provide mean- 
ingful information about the writing processes enacted by 
students, which would subsequently relate to the quality 
and characteristics of their essays. 


Methods 


Participants 


Undergraduate students (n = 131) from a public university 
in the United States participated in the study for course 
credit. On average, the students were 19.8 years in age, 
with 44.3% identifying as female, 64.1% Caucasian, 14.5% 
Asian, 7.6% African American, 7.6% Hispanic, and 6.1% 
“Other.” 


Data Collection Procedure 


Each student wrote a timed (25-minute), persuasive essay 
in response to a Scholastic Achievement Test (SAT) style 
prompt. The completed essays contained an average of 
412.3 words (SD = 159.9, min = 47.0, max = 980.0). 


Essay Scoring 


Students’ essays were assessed by two independent pairs of 
expert human raters. These raters had previous experience 
scoring academic essays and were compensated for their 
time. The holistic grading rubric was on a 6-point scale and 
based on a standardized rubric typically used for the as- 
sessment of SAT essays. The rubric contained sub-scale 
scores, which assessed the quality of the following aspects 
of the essay: introduction, body, conclusion, word choice, 
sentence structure, organization, topic and global cohe- 
sion, voice, and grammar, style and mechanics. 


Data Processing 


Students’ essays were cleaned in preparation for the RQA. 
All punctuation was removed and the words were convert- 
ed to lower case and stemmed. 

Once the essays were cleaned, the words were converted 
into series of categorical numeric codes, wherein the codes 
represented the individual word types (i.e., the unique 
words) in each essay. For instance, the sentence, “Dogs eat 
dog food.” would be converted to the series: {1, 2, 1, 3}. 


Recurrence Quantification Analysis 


The crqa R library (Coco & Dale, 2013) was used to gen- 
erate recurrence plots and calculate recurrence indices for 
the essays. The resulting indices are described in Table 1. 


Statistical Analyses 


To assess the degree to which the patterns of recurrence in 
students’ essays were associated with their quality, we cal- 
culated Pearson correlations and regression analyses be- 
tween students’ essay scores and the RQA indices. 

Normality of the indices was assessed with skew, kurto- 
sis, and visual data inspections. One index, Line Number, 
was strongly skewed; therefore, we calculated the log 
transformation for this index. 

Pearson correlations were used to assess relations be- 
tween word recurrence and essay scores. We calculated 
these correlations for students’ holistic essay scores, as 
well as the sub-scale scores. Multicollinearity was then 
assessed among the significantly or marginally significant- 
ly correlated indices (r > .90); in the case that two varia- 
bles demonstrated multicollinearity, the index with the 
highest correlation with the dependent variable was re- 
tained. 
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A stepwise linear regression analysis was conducted to 
assess which of the significant RQA indices were most 
predictive of essay scores. To avoid overfitting the model, 
we chose a ratio of 15 essays to 1 predictor, which allowed 
for a maximum of eight indices to be entered in to the 
model, given that there were 131 essays in the analysis. 

Following this essay score analysis, similar follow-up 
analyses were conducted using the keystroke indices to 
predict the linguistic features of the essays. For these anal- 
yses, we followed the same procedure detailed above. 


Table 1. Description of ROA Indices 


Description 
Recurrence Density of points in recurrence plot. This 
Rate metric represents the overall amount of 
recurrence present in the recurrence plot, 
regardless of the distributions of the points 
Determinism Number of recurrent points that tend to fall 


on diagonal lines (ignoring the LOI). This 
metric provides information about the dis- 
tribution of recurrent points. Systems with 
low determinism are considered less “or- 
dered” than highly deterministic systems. 


Line Number Number of lines in the recurrence plot. 
Lines are defined as two or more consecu- 


tive points in a recurrence plot. 


Max Line Length of the longest diagonal line in the 
recurrence plot; therefore, this metric re- 
veals if a system revisits a long sequence of 


states at a particular point in time. 


Average Line Average length of the diagonal lines in the 
recurrence plot; this metric therefore pro- 
vides information about the average length 


of sequences of states 


Entropy Shannon entropy of the distribution of the 
line lengths in the recurrence plot. Entropy 
will be higher if the system revisits a wider 
variety of state sequences over time. 

Normalized Entropy variable normalized by the number 

Entropy of lines in the plot 


Results 


Recurrence and Essay Quality 


Pearson correlations were calculated between the RQA 
indices and students’ holistic essay scores to examine the 
strength of the relationships among the variables. The re- 
sults of this analysis identified five RQA indices that 
demonstrated a significant or marginally significant rela- 
tion with holistic essay scores (see Table 2). 


A linear regression analysis was calculated with these 
five RQA indices as predictors of students’ holistic essay 
scores (score range: 1-6). This analysis yielded a signifi- 
cant model, R? = .432, p < .001, with four variables that 
combined to account for 43% of the variance in the essay 
scores: Log of Line Number [B = 0.54, p < .001], Determin- 
ism [B = -.42, p < .001], Average Line [B = 0.49, p < .001], 
and Max Line [B = -0.22, p< .05]. 


Table 2. Correlations between ROA Indices and Essay Scores 


These correlation and regression analyses indicate better 
writers produced essays that contained a higher quantity of 
recurrent sequences (Log of Line Number), as well as long- 
er recurrent sequences, on average (i.e., they had a longer 
Average Line Length, which indicates that lines of recur- 
rent sequences were longer on average). However, these 
essays were also less deterministic overall, suggesting that 
these essays contained a higher quantity of individual re- 
current points (words) than sequences of words. 


ee Word Sentence hatiee . 3 Grammar/ 
RQA Index Holistic Intro. Body Cone. Chaise Shrnctare Organization Cohesion Voice Mechanics 
neruareee -.030.—--.101 005 031 -.058 -.022 064 -.005 ee 
Rate 
Determinism 204 -.006 [dem -.046 2230 -.079 -.128 
Log of Line 452 «297 +~=«.408-—S—«CAT2. 354 405 320 
Number 
Max Line 073 110 -011 076 055 
Average 242 101 015 148 | ee 002 
Line 
Entropy 139 .032 .110 ey 118 113 
Normalized = 914-044. Ss --064. Ss --015.—«001 -.042 -.072 028 -.006 
Entropy 
p <.001 (light gray); p <.05 (MCdiimgFay); Marginal (GEES) 
Word Choice 


Recurrence and Essay Characteristics 


Our second goal was to examine whether the RQA indices 
were related to the characteristics of the students’ essays. 
Pearson correlations were calculated between the RQA 
indices and the nine sub-scale essay scores (see Table 2) 
and followed by regression analyses. The statistical infor- 
mation for these resulting models is provided below. 


Introduction Quality 

The regression yielded a significant model, R? = .117, p < 
.001 with two significant predictors: Log of Line Number 
[6 = 0.27, p< .001] and Average Line [B = 0.17, p < .001]. 
Body Quality 

The regression yielded a significant model, R’ = .384, p < 
.001 with three significant predictors: Log of Line Number 
[B = 0.43, p < .001], Determinism [B = -0.49, p < .001], and 
Average Line [B = 0.38, p < .001]. 

Conclusion Quality 

The regression yielded a significant model, R? = .223, p < 


.001 with one significant predictor: Log of Line Number [B 
= 0.43, p< .001]. 
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The regression yielded a significant model, R? = .114, p < 
.001 with three significant predictors: Determinism [B = - 
0.20, p < .05], Log of Line Number [B = 0.23, p < .05], and 
Recurrence Rate [B = -0.19, p < .05]. 


Sentence Structure 

The regression yielded a significant model, R? = .215, p < 
.001 with two significant predictors: Log of Line Number 
[6 = 0.41, p < .001] and Determinism [B = -0.30, p < .001]. 


Organization 

The regression yielded a significant model, R’? = .164, p < 
.001 with one significant predictor: Log of Line Number [B 
= 0.41, p< .001]. 


Topic and Global Cohesion 
None of the RQA indices were entered into this analysis. 


Voice 

The regression yielded a significant model, R’ = .250, p < 
.001 with three significant predictors: Log of Line Number 
[B = 0.34, p < .001], Determinism [B = -0.41, p < .001], and 
Average Line [B = 0.29, p < .01]. 


Grammar, Style, and Mechanics 

The regression yielded a significant model, R 
.01 with two significant predictors: Entropy [B 
.01], and Recurrence Rate [B = -0.27, p< .01]. 


= .106, p < 
= 0.28, p< 


The results of the sub-scale analyses indicate that the RQA 
indices were meaningfully related to the properties of stu- 
dents’ essays at multiple levels; yet, the sub-scale scores 
were more weakly related than the holistic essay scores. 
The regression analysis calculated for body quality was the 
strongest of the models and indicated that essays with 
higher-quality body paragraphs were related to longer 
lines, but lower determinism overall. Additionally, the re- 
gressions for all remaining sub-scales, except cohesion, 
were significant with RQA indices accounting for between 
11 and 25% of the variance. The topic and global cohesion 
score was not significantly related to any of the RQA indi- 
ces, indicating that human perceptions of cohesion were 
not related to recurrent word patterns in students’ writing. 


Discussion 


In the current study, we used dynamic methodologies to 
develop NLP assessments of students’ writing perfor- 
mance. In particular, our goal was to determine whether we 
could model the holistic and sub-scale scores of essays by 
calculating indices related to the temporal distribution of 
the words that they produced. Recurrence quantification 
analysis (RQA) was used to calculate indices related to the 
quantity, length, and distributions of these recurrent word 
patterns. The results revealed that the RQA indices were 
able to model 43% of the variance in students’ holistic es- 
say scores. Additionally, these indices were able to model 
specific characteristics of the essays at multiple levels. 

The essay score analyses revealed that five RQA indices 
were significantly or marginally significantly correlated 
with students’ holistic essay scores. This finding is promis- 
ing and indicates that the overall quality of students’ essays 
can be modeled in the ways in which they distribute the 
words throughout their writing. These initial analyses of 
essay score indicate that the length and variability of the 
word sequences that students produce may be an important 
indicator of their writing skill. In particular, higher-quality 
essays were characterized by longer word sequences, but 
also by a greater variability in these line lengths (Entropy) 
and lower Determinism overall. These analyses speak to 
the importance of accounting for temporal patterns in lan- 
guage. NLP techniques often rely on summative metrics of 
text features to characterize student writing; however, these 
results suggest that expanding these analyses to include 
temporal information can provide insights into the charac- 
teristics of quality writing. 
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The essay characteristic analyses additionally revealed 
similarities and differences between the RQA indices and 
the quality of students’ writing at multiple levels. Perfor- 
mance on all of the essay scores was related to a greater 
number of recurrent word sequences (Log of Line Num- 
ber), and all of the paragraph quality metrics demonstrated 
significant correlations with Average Line Length and 
Maximum Line Length. This finding suggests that writing 
quality at multiple levels can be characterized by a greater 
amount of recurrent information throughout the text, as 
well as longer sequences of these recurrent words. 

Beyond these similarities, the correlations were indica- 
tive of differences among the essay scores. Specifically, 
while the relations between the RQA indices and the essay 
scores were generally similar in their directionality, they 
largely differed in magnitude. For instance, the quality of 
students’ body paragraphs was significantly related to four 
of the seven RQA indices, whereas the topic and global 
cohesion score demonstrated only three marginally signifi- 
cant correlations with the indices. These results suggest 
that these recurrent word patterns can provide fine-grained 
information about writing quality that moves beyond holis- 
tic scores. 

The results of the current study provide initial evidence 
for the usefulness of dynamic analyses of language. How- 
ever, there remain a number of open questions to be an- 
swered in additional research. For instance, do these indi- 
ces map onto similar quality metrics across multiple text 
genres? Similarly, do the indices predict writing quality for 
different age groups and native languages? These questions 
and many more remain to be answered in future research. 

An important note is that our analyses only focused on 
the individual words in students’ essays. We did not ac- 
count for the various additional sources of information that 
are currently afforded in NLP analyses, such as parts-of- 
speech, semantic information, or sophistication. We strate- 
gically chose to focus this initial study on the individual 
words in order to provide a demonstration of the strength 
of RQA in the absence of this additional information. 
However, this study by no means represents the limit of its 
potential. RQA is a highly flexible technique that can be 
used to analyze any temporal data -- continuous or categor- 
ical. For instance, one could imagine examining recurrent 
patterns in the topics discussed in students’ essays, the 
parts-of-speech or the sophistication of the words. Future 
analyses such as these will no doubt provide important 
insights into the structure of student language. 

Overall, our results suggest that RQA can be utilized to 
guide dynamic assessments of students’ writing quality. 
Our eventual goal is to use these indices to develop more 
nuanced assessments of essay quality, which can then be 
used to drive formative feedback in adaptive educational 
technologies. Although this study provides only a first step 
toward that goal, and a number of future research remains 


to be conducted, these results provide a foundation on 
which to conduct research that considers the dynamic na- 
ture of student language. 
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